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Abstract 

Inspired by Random Forests (RF) in the context of classification, we pro- 
pose a new clustering ensemble method — Cluster Forests (CF). Geometri- 
cally, CF randomly probes a high-dimensional data cloud to obtain "good 
local clusterings" and then aggregates via spectral clustering to obtain 
cluster assignments for the whole dataset. The search for good local clus- 
terings is guided by a cluster quality measure k. CF progressively im- 
proves each local clustering in a fashion that resembles the tree growth in 
RF. Empirical studies on several real-world datasets under two different 
performance metrics show that CF compares favorably to its competitors. 
Theoretical analysis shows that the n criterion is shown to grow each lo- 
cal clustering in a desirable way — it is "noise-resistant." A closed- form 
expression is obtained for the mis-clustering rate of spectral clustering 
under a perturbation model, which yields new insights into some aspects 
of spectral clustering. 



1 Motivation 

The general goal of clustering is to partition a set of data such that data points 
■within the same cluster are "similar" -while those from different clusters are "dis- 
similar." An emerging trend is that ne-w applications tend to generate data in 
very high dimensions for -which traditional methodologies of cluster analysis do 
not -work -well. Remedies include dimension reduction and feature transforma- 
tion, but it is a challenge to develop effective instantiations of these remedies in 
the high-dimensional clustering setting. In particular, in datasets -whose dimen- 
sion is beyond 20, it is infeasible to perform full subset selection. Also, there 
may not be a single set of attributes on -which the -whole set of data can be 
reasonably separated. Instead, there may be local patterns in -which different 
choices of attributes or different projections reveal the clustering. 



^The authors can be contacted via email at: dhyan@berkeley.edu, aiyouchen@google.com, 
jordan@eecs.berkeley.edu. Aiyou Chen is currently with Google, Mountain View, CA 94043. 
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Our approach to meeting these challenges is to randomly probe the data/ feature 
space to detect many locally "good" clusterings and then aggregate by spectral 
clustering. The intuition is that, in high-dimensional spaces, there may be pro- 
jections or subsets of the data are well separated and these projections or subsets 
may carry information about the cluster membership of the data involved. If 
we can effectively combine such information from many different views (here a 
view has two components, the directions or projections we are looking at and 
the part of data that are involved) , then we can hope to recover the cluster as- 
signments for the whole dataset. However, the number of projections or subsets 
that are potentially useful tend to be huge, and it is not feasible to conduct a 
grand tour of the whole data space by exhaustive search. This motivates us to 
randomly probe the data space and then aggregate. 

The idea of random projection has been explored in various problem do- 
mains such as clustering [T21 [5] , manifold learning [TH] , and compressive sensing 
[TU] . However, the most direct motivation for our work is the Random Forests 
(RF) methodology for classification [7^. In RF, a bootstrap step selects a subset 
of data while the tree growth step progressively improves a tree from the root 
downwards — each tree starts from a random collection of variables at the root 
and then becomes stronger and stronger as more nodes are split. Similarly, we 
expect that it will be useful in the context of high-dimensional clustering to go 
beyond simple random probings of the data space and to perform a controlled 
probing in hope that most of the probings are "strong." This is achieved by pro- 
gressively refining our "probings" so that eventually each of them can produce 
relatively high-quality clusters although they may start weak. In addition to 
the motivation from RF, we note that similar ideas have been explored in the 
projection pursuit literature for regression analysis and density estimation (see 
and references therein). 

RF is a supervised learning methodology and as such there is a clear goal 
to achieve, i.e., good classification or regression performance. In clustering, the 
goal is less apparent. But significant progress has been made in recent years in 
treating clustering as an optimization problem under an explicitly defined cost 
criterion; most notably in the spectral clustering methodology HI] ■ Using 
such criteria makes it possible to develop an analog of RF in the clustering 
domain. 

Our contributions can be summarized as follows. We propose a new cluster 
ensemble method that incorporates model selection and regularization. Em- 
pirically CF compares favorably to some popular cluster ensemble methods. 
We also provide some theoretical support for our work: (1) Under a simplified 
model, CF is shown to grow the clustering instances in a "noise-resistant" man- 
ner; (2) we obtain a closed-form formula for the mis-clustering rate of spectral 
clustering under a perturbation model that yields new insights into aspects of 
spectral clustering that are relevant to CF. 

The remainder of the paper is organized as follows. In Section [2J we present 
a detailed description of CF. Related work is discussed in Section [3] This is 
followed by an analysis of the k criterion and the mis-clustering rate of spectral 
clustering under a perturbation model in Section |4l We evaluate our method 
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in Section [5] by simulations on Gaussian mixtures and comparison to several 
popular cluster ensemble methods on real-world data. Finally we conclude in 
Section [5] 



2 The Method 

CF is an instance of the general class of cluster ensemble methods [36 , and as 
such it is comprised of two phases: one which creates many cluster instances 
and one which aggregates these instances into an overall clustering. We begin 
by discussing the cluster creation phase. 



2.1 Growth of clustering vectors 

CF works by aggregating many instances of clustering problems, with each 
instance based on a different subset of features (with varying weights). We 
define the feature space T = {l,2,...,p} as the set of indices of coordinates 
in W . We assume that we are given n i.i.d. observations Xi, ...,X„ e M^. A 
clustering vector is defined to be a subset of the feature space. 

The growth of a clustering vector is governed by the following cluster quality 
measure: 

«(/) = ^-^^ (1) 

where SSw and SSb are the within-cluster and between-cluster sum of squared 
distances (see Section [7.2^ . computed on the set of features currently in use 
(denoted by /), respectively. 

Using this quality measure, we iteratively expand the clustering vector. 
Specifically, letting r denote the number of consecutive unsuccessful attempts 
in expanding the clustering vector /, and letting be the maximal allowed 
value of T, the growth of a clustering vector is described in Algorithm [1] 

Algorithm 1 The growth of a clustering vector / 

1: Initialize / to be NULL and set r = 0. 

2: Apply feature competition and update / 4- {f[°\ /b*^^)- 

3: repeat 

4: Sample b features, denoted as /i, from the feature space T. 

5: Apply if-means (the base clustering algorithm) to the data induced by 

the feature vector (/, /i, /b). 
6: if k(/, /!..._, fb) < K(/)_then 
7: expand / by / ^ (/, /i, ft) and set r 4- 
8: else 

9: discard {/i, /b} and set t r + 1 
10: end if 
11: until T > r„, 
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Algorithm 2 Feature competition 



1: for i = 1 to q do 

2: Sample b featmes, fj^^\ from the feature space J-; 

3: Apply if-means to the data projected on (/}*■',..., Z^^*-*) to get 



4: end for 



5: Set (/f\...,/("))^argmin?^,^(/«,...,/f ). 



Step 2 in Algorithm [T] is called feature competition (setting q — I reduces 
to the usual mode) . It aims to provide a good initialization for the growth of a 
clustering vector. The feature competition procedure is detailed in Algorithm[2l 

Feature competition is motivated by Theorem [T] in Section 14.11 — it helps 
prevent noisy or "weak" features from entering the clustering vector at the 
initialization, and, by Theorem [1] the resulting clustering vector will be formed 
by "strong" features which can lead to a "good" clustering instance. This will 
be especially helpful when the number of noisy or very weak features is large. 
Note that feature competition can also be applied in other steps in growing the 
clustering vector. A heuristic for choosing q is based on the "feature profile 
plot," a detailed discussion of which is provided in Section [5^ 



The CF algorithm is detailed in Algorithm [3l The key steps are: (a) grow 
T clustering vectors and obtain the corresponding clusterings; (b) average the 
clustering matrices to yield an aggregate matrix P; (c) regularize P; and (d) 
perform spectral clustering to the regularized matrix. The regularization step 
is done by thresholding P at level (32', that is, setting _Py to be if it is less 
than a constant /32 € (Oil)? followed by a further nonlinear transformation 
P exp(/3iP) which we call scaling. 



Algorithm 3 The CF algorithm 
1: for Z = 1 to T do 

2: Grow a clustering vector, /'^'^ according to Algorithm [TJ 

3: Apply the base clustering algorithm to the data induced by clustering 

vector Z*^'^ to get a partition of the data; 
4: Construct n x n co-cluster indicator matrix (or affinity matrix) P*^') 



2.2 The CF algorithm 




{ 



1, if Xi and Xj are in the same cluster 
0, otherwise 



5: end for 

6: Average the indicator matrices to get P ^ Y^'jLi -P^'-*; 

7: Regularize the matrix P; 

8: Apply spectral clustering to P to get the final clustering. 
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We provide some justification for our choice of spectral clustering in Sec- 
tion 14.21 As the entries of matrix P can be viewed as encoding the pairwise 
similarities between data points, any clustering algorithm based on pairwise 
similarity can be used as the aggregation engine. 

3 Related Work 

In this section, we compare and contrast CF to other work on cluster ensembles. 
It is beyond our scope to attempt a comprehensive review of the enormous body 
of work on clustering, please refer to [531 [H] for overview and references. We will 
also omit a discussion on classifier ensembles, see [IB] for references. Our focus 
will be on cluster ensembles. We discuss the two phases of cluster ensembles, 
namely, the generation of multiple clustering instances and their aggregation, 
separately. 

For the generation of clustering instances, there are two main approaches — 
data re-sampling and random projection. [TT] and [25] produce clustering in- 
stances on bootstrap samples from the original data. Random projection is used 
by [12] where each clustering instance is generated by randomly projecting the 
data to a lower-dimensional subspace. These methods are myopic in that they 
do not attempt to use the quality of the resulting clusterings to choose samples 
or projections. Moreover, in the case of random projections, the choice of the 
dimension of the subspace is myopic. In contrast, CF proceeds by selecting 
features that progressively improve the quality (measured by k) of individual 
clustering instances in a fashion resembling that of RF. As individual clustering 
instances are refined, better final clustering performance can be expected. We 
view this non-myopic approach to generating clustering instances as essential 
when the data lie in a high-dimensional ambient space. Another possible ap- 
proach is to generate clustering instances via random restarts of a base clustering 
algorithm such as ivT- means [TJ . 

The main approaches to aggregation of clustering instances are the co- 
association method [SHI [IS] and the hyper-graph method [33] ■ The co-association 
method counts the number of times two points fall in the same cluster in the 
ensemble. The hyper-graph method solves a fc-way minimal cut hyper-graph 
partitioning problem where a vertex corresponds to a data point and a link is 
added between two vertices each time the two points meet in the same cluster. 
Another approach is due to [37], who propose to combine clustering instances 
with mixture modeling where the final clustering is identified as a maximum 
likelihood solution. CF is based on co-association, specifically using spectral 
clustering for aggregation. Additionally, CF incorporates regularization such 
that the pairwise similarity entries that are close to zero are thresholded to 
zero. This yields improved clusterings as demonstrated by our empirical stud- 
ies. 

A different but closely related problem is clustering aggregation [T7] which 
requires finding a clustering that "agrees" as much as possible with a given 
set of input clustering instances. Here these clustering instances are assumed 
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to be known and the problem can be viewed as the second stage of clustering 
ensemble. Also related is ensemble selection [H [131 E] which is applicable to 
CF but this is not the focus of the present work. Finally there is unsupervised 
learning with random forests [3 [3S] where RF is used for deriving a suitable 
distance metric (by synthesizing a copy of the data via randomization and using 
it as the "contrast" pattern) so they are fundamentally different from ours. 

4 Theoretical Analysis 

In this section, we provide a theoretical analysis of some aspects of CF. In 
particular we develop theory for the k criterion, presenting conditions under 
which CF is "noise-resistant." By "noise-resistant" we mean that the algorithm 
can prevent a pure noise feature from entering the clustering vector. We also 
present a perturbation analysis of spectral clustering, deriving a closed-form 
expression for the mis-clustering rate. 

4.1 CF is noise-resistant 

We analyze the case in which the clusters are generated by a Gaussian mixture: 

AAA(/x,E) + (l-A)AA(-At,I]), (2) 

where A g {0, 1} with P(A — 1) ^ tt specifies the cluster membership of an 
observation, and Af{fJ,, S) stands for a Gaussian random variable with mean 
/It = ^[p]) G W and covariance matrix E. We specifically consider 

77 = ^ and E = Ipxp', this is a simple case which yields some insight into the 
feature selection ability of CF. We start with a few definitions. 

Definition. Let /i : Rp i— {0, 1} be a decision rule. Let A be the cluster 
membership for observation X. A loss function associated with h is defined as 

The optimal clustering rule under ([3]) is defined as 

h* = argminE?(/i(A), A), (4) 

heg 

where G — {h : MP i-^ {0, 1}} and the expectation is taken with respect to the 
random vector {X, A). 

Definition |32j . For a probability measure Q on R*^ and a finite set A C R'', 
define the within cluster sum of distances by 

<i>{A,Q)^ [ mm^{\\x-a\\)Q{dx) (5) 

J a£A 

where (j}{\\x — a\\) defines the distance between points x and a Cz A. if-mcans 
clustering seeks to minimize ^{A, Q) over a set A with at most K elements. We 



6 



focus on the case = x'^, K — 2 and refer to {//q, /i*} = argmin ^{A, Q) as 

A 

the population cluster centers. 

Definition. The feature is called a noise feature if iJ,[i] = where 
denotes the i*'' coordinate of /i. A feature is "strong" ("weak") if |/Lt[«]| is "large" 
("small"). 

Theorem 1. Assume the cluster is generated by Gaussian mixture ^ with 
E = J and tt — ^. Assume one feature is considered at each step and duplicate 
features are excluded. Let I ^ 6e the set of features currently in the clustering 
vector and let /„ be a noise feature such that fn ^ I. U'Yliei (MoM ^ /^iM)^ > 
0, then k(J) < k({I,/„}). 

Remark. The interpretation of Theorem [T] is that noise features are gener- 
ally not included in cluster vectors under the CF procedure; thus, CF with the 
K criterion is noise-resistant. 

The proof of Theorem[T]is in the appendix. It proceeds by explicitly calculat- 
ing SSb and SSw (see Section rr2]) and thus an expression for n = SSw/SSb- 
The calculation is facilitated by the equivalence, under tt = ^ and S = /, of 
i^-means clustering and the optimal clustering rule h* under loss function (jSj. 

4.2 Quantifying the mis-clustering rate 

Recall that spectral clustering works on a weighted similarity graph t/(V,P) 
where V is formed by a set of data points Xi,i — 1, ...,n and P encodes their 
pairwise similarities. Spectral clustering algorithms compute the eigendecompo- 
sition of the Laplacian matrix C{P) = D~^/^{I— P)D^^/^ where D \s a. diagonal 
matrix with diagonals being degrees of Q. Different notions of similarity and 
ways of using the spectral decomposition lead to different spectral cluster algo- 
rithms [SH EH |30l EH EH Sg. In particular, Ncut [34| forms a bipartition of 
the data according to the sign of the components of the second eigenvector (i.e., 
corresponding to the second smallest eigenvalue) of C{P). On each of the two 
partitions, Ncut is then applied recursively until a stopping criterion is met. 

There has been relatively little theoretical work on spectral clustering; ex- 
ceptions include [H |30l EH IMl ESI [U |40] . Here we analyze the mis-clustering rate 
for symmetrically normalized spectral clustering. For simplicity we consider the 
case of two clusters under a perturbation model. 

Assume that the similarity (afhnity) matrix can be written as 

P = P + £, (6) 

where 

-p.^l 1^^' if < or i,j > ni 
'■^ [ V, otherwise. 

and e = (£y )i is a symmetric random matrix such that E£ij = 0. Here ni and 
712 are the size of the two clusters. Let n2 = 7^1 and n = ni + n2. Without 
loss of generality, assume 7 < 1. Similar models have been studied in earlier 
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work; see, for instance, [301 [211 [21 E] • Our focus is different; we aim at the mis- 
clustering rate due to perturbation. Such a model is appropriate for modeling 
the similarity (affinity) matrix produced by CF. For example, Figure [T] shows 
the affinity matrix produced by CF on the Soybean dataset [5]; this matrix is 
nearly block-diagonal with each block corresponding to data points from the 
same cluster (there are totally 4 of them) and the off-diagonal elements are 
mostly "close" to 0. Thus a perturbation model as (|6|) is a "good" approximation 
to the similarity matrix produced by CF and allows us to gain insights into the 
nature of CF. 




Figure 1: The affinity matrix produced by CF for the Soybean dataset with 4 
clusters. The number of clustering vectors in the ensemble is 100. 

Let A4 be the mis-clustering rate, i.e., the proportion of data points assigned 
to a wrong cluster (i.e., h{X) ^ A). Theorem [2] characterizes the expected value 
of under perturbation model ((Gj). 

Theorem 2. Assume that e-ij, i > j are mutually independent J\f{0,a'^). Let 
< < 7 < 1. Then 

1 72 
lim -log(EX) -— — '— r-. (7) 

The proof is in the appendix. The main step is to obtain an analytic ex- 
pression for the second eigenvector of C{P). Our approach is based on matrix 
perturbation theory [25], and the key idea is as follows. 
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Let ^{A) denote the eigenprojection of a linear operator A : R" — > R". 
Then, can be expressed exphcitly as the foUowing contour integral 



where F is a simple Jordan curve enclosing the eigenvalues of interest (i.e., the 
first two eigenvalues of matrix C{P)) and excluding all others. The eigenvectors 
of interest can then be obtained by 



where LOi,i — 1,2 are fixed linearly independent vectors in M". An explicit 
expression for the second eigenvector can then be obtained under perturbation 
model ([6]), which we use to calculate the final mis-clustering rate. 

Remarks. While formula Q is obtained under some simplifying assump- 
tions, it provides insights into the nature of spectral clustering. 

1) . The mis-clustering rate increases as a increases. 

2) . By checking the derivative, the right hand side of ([7]) is a unimodal func- 

tion of 7, minimized at 7 = 1 with a fixed a. Thus the mis-clustering rate 
decreases as the cluster sizes become more balanced. 

3) . When 7 is very small, i.e., the clusters are extremely unbalanced, spectral 

clustering is likely to fail. 

These results are consistent with existing empirical findings. In particular, they 
underscore the important role played by the ratio of two cluster sizes, 7, on the 
mis-clustering rate. Additionally, our analysis (in the proof of Theorem [21) also 
implies that the best cutoff value (when assigning cluster membership based on 
the second eigenvector) is not exactly but shifts slightly towards the center 
of those components of the second eigenvector that correspond to the smaller 
cluster. A closely related work is 21] which studies end-to-end perturbation 
and the final mis-clustering rate is approximate in nature. Theorem [2] is based 
on a perturbation model for the affinity matrix and provides, for the first time, 
a closed-form expression for the mis-clustering rate of spectral clustering under 
such a model. 

5 Experiments 

Two sets of experiments are performed, one on synthetic data, specifically de- 
signed to demonstrate the feature selection and "noise-resistance" capability of 
CF, and the other on several real-world datasets |3] where we compare the over- 
all clustering performance of CF with several competitors under two different 
metrics. These experiments are presented in separate subsections. 





ipk = ^uji, i 1,2 



(9) 
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5.1 Feature selection capability of CF 

In this subsection, we describe three simulations that aim to study the feature 
selection capability and "noise-resistance" feature of CF. Assume the underlying 
data are generated i.i.d. by Gaussian mixture ©. 

In the first simulation, a sample of 4000 observations is generated from 
([2]) with iJb — (0, 0, 1, 2, 100)"^ and the diagonals of S are all 1 while the 
non-diagonals are i.i.d. uniform from [0, 0.5] subject to symmetry and positive 
definitiveness of S. Denote this dataset as Gi. At each step of cluster growing 
one feature is sampled from and tested to see if it is to be included in the clus- 
tering vector by the k criterion. We run the clustering vector growth procedure 
until all features have been attempted with duplicate features excluded. 100 
clustering vectors are generated. In Figure [2l all but one of the 100 clustering 
vectors include at least one feature from the top 3 features (ranked according 
to the value) and all clustering vectors contain at least one of the top 5 

features. 




Figure 2: The occurrence of individual features in the 100 clustering vectors for 
Gi. The left plot shows the features included (indicated by a solid circle) in 
each clustering vector. Each horizontal line corresponds to a clustering vector. 
The right plot shows the total number of occurrences of each feature. 

We also performed a simulation with "noisy" data. In this simulation, data 
are generated from ^ with S = J, the identity matrix, such that the first 100 
coordinates of /x are and the next 20 are generated i.i.d. uniformly from [0, 1]. 
We denote this dataset as G2. Finally, we also considered an extreme case where 
data are generated from ^ with S = / such that the first 1000 features are 
noise features and the remaining 20 are useful features (with coordinates of 
from ±1 to ±20); this is denoted as G3. The occurrences of individual features 
for G2 and G3 are shown in Figure [31 Note that the two plots in Figure [3] are 
produced by invoking feature competition with g = 20 and q = 50, respectively. 
It is worthwhile to note that, for both G2 and G3, despite the fact that a 
majority of features are pure noise (100 out of a total of 120 for G2 or 1000 
out of 1020 for G3, respectively), CF achieves clustering accuracies (computed 
against the true labels) that are very close to the Bayes rates (about 1). 
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Figure 3: The occurrence of individual features in the 100 clustering vectors 
for G2 and G3. The left plot is for G2 where the first 100 features are noise 
features. The right plot is for G3 where the first 1000 features are noise ones. 

5.2 Experiments on UC Irvine datasets 

We conducted experiments with six UC Irvine datasets jS], the Soybean, SPECT 
Heart, image segmentation (ImgSeg), Heart, Wine and Wisconsin breast cancer 
(WDBC) datasets. A summary is provided in TableHJ Note that true labels are 
available for all six datasets. We use the labels to evaluate the performance of 
the clustering methods, while recognizing that this evaluation is only partially 
satisfactory. The evaluation is based on two different performance metrics for 
clustering, pr and pc, to be defined in the following. 



Dataset 


Features 


Classes 


^Instances 


Soybean 


35 


4 


47 


SPECT 


22 


2 


267 


ImgSeg 


19 


7 


2100 


Heart 


13 


2 


270 


Wine 


13 


3 


178 


WDBC 


30 


2 


569 



Table 1: A summary of datasets. 



Definition. One measure of the quality of a cluster ensemble is given by 
Number of correctly clustered pairs ^ 

Pr = ^— 1 ' r ■ X 100% 

iotal number ot pairs 

where by "correctly clustered pair" we mean two instances have the same co- 
cluster membership (that is, they are in the same cluster) by both CF and the 
labels in the original dataset. 

Definition. Another performance metric is the clustering accuracy. Let 
z — {1,2,..., J} denote the set of class labels, and 6{.) and /(.) the true label 
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and the label obtained by a clustering algorithm, respectively. The clustering 
accuracy is defined as 

Pe(/) = max I - f]l{r (/(X,)) = e{X,)} \ , (10) 

where I is the indicator function and II^ is the set of all permutations on the 
label set z. This measure is a natural extension of the classification accuracy 
(under 0-1 loss) and has been used by a number of work in clustering, e.g., 
[271141]. 

The idea of having two different performance metrics is to assess a clustering 
algorithm from different perspectives since one metric may particularly favor 
certain aspects while overlooking others. For example, in our experiment we 
observe that, for some datasets, some clustering algorithms (e.g., RP or EA) 
achieve a high value of pr but a small Pc on the same clustering instance (note 
that Pc and pr as reported in Table [5] and Table [3] may be calculated under 
different parameter settings). 



Dataset 


CP 


RP 


bC2 


EA 


Soybean 


92.36 


87.04 


83.16 


86.48 


SPECT 


56.78 


49.89 


50.61 


51.04 


ImgSeg 


79.71 


85.88 


82.19 


85.75 


Heart 


56.90 


52.41 


51.50 


53.20 


Wine 


79.70 


71.94 


71.97 


71.86 


WDBC 


79.66 


74.89 


74.87 


75.04 



Table 2: pr for different datasets and methods (CP calculated when q — 1). 



Dataset 


CP 


RP 


bC2 


EA 


Soybean 


84.43 


71.83 


72.34 


76.59 


SPECT 


68.02 


61.11 


56.28 


56.55 


ImgSeg 


48.24 


47.69 


49.91 


51.31 


Heart 


68.26 


60.54 


59.10 


59.26 


Wine 


79.19 


70.79 


70.22 


70.22 


WDBC 


88.70 


85.41 


85.38 


85.41 



Table 3: pc for different datasets and methods (CP calculated when q ^ 1). 

We compare CP with three other cluster ensemble algorithms — bagged clus- 
tering (bC2, [H]), random projection (RP, 12|), and evidence accumulation 
(EA, [15 )• We made slight modifications to the original implementations to 
standardize our comparison. These include adopting K-means clustering (K- 
medoids is used for bC2 in jllj but differs very little from if-means on the 
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datasets we have tried) to be the base clustering algorithm, changing the ag- 
glomerative algorithm used in RP to be based on single linkage in order to match 
the implementation in EA. 

We now list the parameters used in our implementation. Define the number 
of initial clusters, nt, to be that of clusters in running the base clustering algo- 
rithm; denote the number of final clusters by n/. In CF, the scaling parameter 
(3i is set to be 10 (i.e., 0.1 times the ensemble size); the thresholding level ^2 
is 0.4 (we find very little difference in performance by setting (^2 & [0.3,0.5]); 
the number of features, 6, sampled each time in growing a clustering vector is 
2; we set r,„ = 3 and rih — Uf. In RP, the dimension of the target subspace 
for random projection is searched from 5 onwards and we set Ub = rif. EA [14j 
suggests using ^/n [n being the sample size) for n?,. This sometimes leads to 
unsatisfactory results (which is the case for all except two of the datasets) and if 
that happens we replace it with n/. In EA, the threshold value, for the single 
linkage algorithm is searched through {0.3,0.4,0.5,0.6,0.7,0.75} as suggested 
by [13]. In bC2, we set ni, = Uf according to [TT]. 

Table [H and Table [3] show the values of pr and pc achieved on the six UC 
Irvine datasets using RP, bC2, EA and CF. The ensemble size is 100 and results 
averaged over 100 runs. We take g = 1 for CF in producing these two tables. 
We see that CF compares favorably to its competitors; it yields the largest pr 
(or Pc) for all but one of the six datasets and the performance gain is substantial 
in most cases (i.e., 4 out of 6). 

We also explore the feature competition mechanism in the very initial round 
of CF (cf. Section [23J. According to Theorem [U in cases where there are many 
noise features or weak features, feature competition will decrease the chance of 
obtaining a weak clustering instance, hence a boost in the ensemble performance 
can be expected. In Tableland Table 21 we report results by varying the value 
of q in the feature competition step. 

We define the feature profile plot to be the histogram of the strengths of each 
individual feature, where feature strength is defined as the k value computed on 
the dataset using this feature alone. (For categorical variable when the number 
of categories on this variable is smaller than the number of clusters, the strength 
of this feature is sampled at random from the set of strengths of other features.) 
Figure |4] is the feature profile plot of the six UC Irvine datasets used in our 
experiment. A close inspection of results presented in Table [5] and Table |4] 
show that this plot can roughly guide us in choosing a q that leads to "good" 
performance for each individual dataset. We thus propose the following rule of 
thumb based on the feature profile plot: use large q when there are many weak 
or noise features; otherwise small q or no feature competition at all. 

6 Conclusion 

We have proposed a new method for ensemble-based clustering. Our experi- 
ments show that CF compares favorably to existing clustering ensemble meth- 
ods, including bC2, evidence accumulation and RP. We have provided support- 
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Figure 4: The feature profile plot for the 6 UC Irvine datasets. 
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q 


Soybean 


SPECT 


ImgSeg 


Heart 


Wine 


WDBC 


1 


92.36 


56.78 


79.71 


56.90 


79.70 


79.93 


2 


92.32 


57.39 


77.62 


60.08 


74.02 


79.94 


3 


94.42 


57.24 


77.51 


62.51 


72.16 


79.54 


5 


93.89 


57.48 


81.17 


63.56 


71.87 


79.41 


10 


93.14 


56.54 


82.69 


63.69 


71.87 


78.90 


15 


94.54 


55.62 


83.10 


63.69 


71.87 


78.64 


20 


94.74 


52.98 


82.37 


63.69 


71.87 


78.50 



Table 4: The pr achieved by CF for q e {1, 2, 3, 5, 10, 15, 20}. Results averaged 
over 100 runs. Note the first row is taken from Table [21 



q 


Soybean 


SPECT 


ImgSeg 


Heart 


Wine 


WDBC 


1 


84.43 


68.02 


48.24 


68.26 


79.19 


88.70 


2 


84.91 


68.90 


43.41 


72.20 


72.45 


88.71 


3 


89.85 


68.70 


41.12 


74.93 


70.52 


88.45 


5 


89.13 


68.67 


47.92 


76.13 


70.22 


88.37 


10 


88.40 


66.99 


49.77 


76.30 


70.22 


88.03 


15 


90.96 


65.15 


49.65 


76.30 


70.22 


87.87 


20 


91.91 


60.87 


52.79 


76.30 


70.22 


87.75 



Table 5: The pc achieved by CF for q e {1, 2, 3, 5, 10, 15, 20}. Results averaged 
over 100 runs. Note the first row is taken from Tabled 



ing theoretical analysis, showing that CF with k is "noise-resistant" under a 
simplified model. We also obtain a closed-form formula for the mis-clustering 
rate of spectral clustering which yields new insights into the nature of spectral 
clustering, in particular it underscores the importance of the relative size of 
clusters to the performance of spectral clustering. 

7 Appendix 

In this appendix, Section lT^ and Section [7751 are devoted to the proof of Theorem 
1 and Theorem 2, respectively. Section 17.11 deals with the equivalence, in the 
population, of the optimal clustering rule (as defined by equation (4) in Section 
4.1 of the main text) and if -means clustering. This is to prepare for the proof 
of Theorem 1 and is of independent interest (e.g., it may help explain why 
ii"- means clustering may be competitive on certain datasets in practice). 

7.1 Equivalence of means clustering and the optimal 
clustering rule for mixture of spherical Gaussians 

We first state and prove an elementary lemma for completeness. 
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Lemma 1. For the Gaussian mixture model defined by (2) (Section 4-1) with 
= I and tt = 1/2, in the population the decision rule induced by K -means 
clustering (in the sense of Pollard) is equivalent to the optimal rule h* as defined 
in (4) (Section 4-1) ■ 




-4 -2 2 4 6 



Figure 5: The optimal rule h* and the K-means rule. In the left panel, the 
decision boundary (the thick line) by h* and that by K-means completely overlap 
for a 2-component Gaussian mixture with E = cl. The stars in the figure 
indicate the population cluster centers by K-means. The right panel illustrates 
the optimal rule h* and the decision rule by K-means where K-means compares 
— /IqII a(?amst ||X — /xJ^II while h* compares \\H — against \\H — 

Proof. The geometry underlying the proof is shown in Figure [3 Let /io,So 
and /-tijSi be associated with the two mixture components in (2). By shift- 
invariance and rotation-invariance (rotation is equivalent to an orthogonal trans- 
formation which preserves clustering membership for distance-based cluster- 
ing), we can reduce to the M} case such that /xq — (^o[l],0, ...0) = — /^i with 
YiQ = Yii — I . The rest of the argument follows from geometry and the definition 
of if-means clustering, which assigns X e M"* to class 1 if | |X — 1 1 < | |X — 1 1 , 
and the optimal rule h* which determines X S M'' to be in class 1 if 

- Mo) [X ) > 0, 

or equivalently, 

\\x-^l,\\ < \\x-tio\\. 

□ 

7.2 Proof of Theorem 1 

Proof of Theorem 1. Let Ci and Co denote the two clusters obtained by K- 
means clustering. Let G be the distribution function of the underlying data. 



16 



SSw and SSb can be calculated as follows. 
SSw ^ ]: I \\x-y\\^dG{x)dG{y) + l f \\x - yW'dG{x)dG{y) ^ {a^f , 

^ Jx^yeCi ^ Jx^yeCo 

SSb = [ f \\x-y\\^dGiy)dGix) = ia*^)' + h^i*^fil\\^. 
JxeCi JyeCo * 

If we assume S = Ipxp is always true during the growth of the clustering vector 
(this holds if duplicated features are excluded) , then 

n SSw 4(fT*)2 • ^ ^ 

Without loss of generality, let I — {1,2, d — 1} and let the noise feature be 
the d*'' feature. By the equivalence, in the population, of if-means clustering 
and the optimal clustering rule h* (Lemma [T] in Section 17. ip for a mixture of 
two spherical Gaussians, /C-means clustering assigns x G M"^ to Ci if 

\\x - Aiill < \\x- /ioll, 

which is equivalent to 

d-l d-l 

^(:rW-MiW)'<E(^W-'^oW)'- (12) 

i=l 2=1 

This is true since Hold] = /ii[cJ] = by the assumption that the d*'* feature is 
a noise feature, implies that the last coordinate of the population cluster 
centers for Ci and C2 are the same, that is, ^l[d] — fiQ[d]. This is because, by 
definition, = J^^q. x[j]dG{x) for i = 0,1 and j = 1, d. Therefore adding 
a noise feature does not affect jj/i^ — /^ilp. However, the addition of a noise 
feature would increase the value of (crj)^, it follows that k will be increased by 
adding a noise feature. □ 

7.3 Proof of Theorem 2 

Proof of Theorem 2. To simplify the presentation, some lemmas used here (Lemma[2] 
and Lemma are stated after this proof. 
It can be shown that 



^-l/2p^-l/2 ^ ^A.X,xf 



where Xi are the eigenvalues and x^ eigenvectors, i — 1,2, such that for 1/ = o(l), 

Ai = l + 0p(i/2+,i-i), 

A2 = 1 - j-\l + + Op{iy^ + n-^) 
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and 



(ni7 +n2) ' xi k = ^ ^ / 1 /9\ 

^ ^ I l + 0p(i^ + n-i/2)^ otherwise. 



(71173+ 712)1/2x2 H = 



-7^/2 + Op(!^ + 71^1/2), if 7 < Til 

1 + Op{v + n^'^/'^), otherwise. 



By Lemma [21 we have ||e||2 = Op{^/rl) and thus the i*^ eigenvalues of 
£)-i/2p£)-i/2 for i > 3 are of order Op{n-'^/'^). Note that, in the above, aU 
residual terms are uniformly bounded w.r.t. 77 and v. 

Let 



— I {tI-D-'/^PD-'/^)-'dt, 



where F is a Jordan curve enclosing only the first two eigenvalues. Then, by 
(8) and (9) in the main text (see Section 4.2), ?/;x2 is the second eigenvector of 
D-i/2p£)-i/2 and the mis-clustering rate is given by 



M = - 

n 



Thus 



^/((7^X2)[7]<0)+^/((7^X2)[7]>0) 

';<.n-i i>ni 

EM = Y^[IP((V'X2)W > 0) +7P((7/'X2)[7] < 0)]. 



By Lemma [2] and letting e = _D ^I'^eD ^Z^, we have 



V'X2 



- / (tl - D-'/^PD-'/^ - A \2dt 



27r7 

1- I {l-{tI^D-^/^PD-^'^)-^e] ' [tl - D-^'^PD-^'^] ^^2dt 



l-l[l-itI-D-^'^PD~^'^r^i)-'Mt 



(j)^2 + 0p{n 2), 



where 

0X2 = ^ [ \l+{tI-D-^'^PD-^'^)-H]^2{t-\2y^dt. 
It can be shown that, by the Cauchy Integral Theorem [33] and Lemma [21 
0X2 = + {tl ~ D-^/'^'PD-^/'^^ ^ ex2{t- X2y^dt. 



X2 - A2 ^6X2 + Op(77 2). 
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Let £i be the z*'' column of e. By Slutsky's Theorem, one can verify that 



£1X2 = a 



and 



Thus 



1/2-1 / I + 7^ 7^^^ 



pUo,l)>„''V'y'^^^^li^ (1+0(1)) 



and 



P((^X2)[1] > 0) = P(AA(0,l)>n2a-^/ |+^' + 



Hence 



7' 



Um -log(E7W) 



n-i-oo n 2fT2(l + 7)(1 + 7^) ' 

and the conclusion follows. □ 
Lemma 2. Let P,X2, A2; "0; ^6 defined as above. Then 

(tl - D-^'^PD-''^Y\2 = (t-A2)-ix2 



||?/'X2 - (/>X2||oo = Op{n ^). 

The first part follows from a direct calculation and the proof of the second relies 
on the semi-circle law. The technical details are omitted. 

Lemma 3. Let e = {sijYl j=i ^ symmetric random matrix with Sij ^ Af{0, 1), 
independent for I < i < j < n. Then 

Iklb = Op(v^). 

The proof is based on the moment method (see [IB]) and the details are omitted. 
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